Detecting Patterns in the LSI Term-Term Matrix
نویسندگان
چکیده
Higher order co-occurrences play a key role in the effectiveness of systems used for text mining. A wide variety of applications use techniques that explicitly or implicitly employ a limited degree of transitivity in the co-occurrence relation. In this work we show use of higher orders of co-occurrence in the Singular Value Decomposition (SVD) algorithm and, by inference, on the systems that rely on SVD, such as LSI. Our empirical and mathematical studies prove that term cooccurrence plays a crucial role in LSI. This work is the first to study the values produced in the truncated term-term matrix, and we have discovered an explanation for why certain term pairs receive a high similarity value, while others receive low (and even negative) values. Thus we have discovered the basis for the claim that is frequently made for LSI: LSI emphasizes important semantic distinctions (latent semantics) while reducing noise in the data. The correlation between the number of connectivity paths between terms and the value produced in the truncated term-term matrix is another important component in the theoretical foundation for LSI. Patterns we discover in the LSI term-term matrix will be used, in future work, to develop of an approximation algorithm for LSI. Our goal is to approximate the LSI term-term matrix using a faster algorithm. This matrix can then be used in place of the LSI matrix in a variety of applications, such as our unsupervised term clustering algorithm.
منابع مشابه
Analysis of the values in the LSI Term-Term Matrix
Singular value decomposition (SVD), the process at the heart of Latent Semantic Indexing (LSI), is a computationally expensive procedure. In this paper we analyze the relationship between higher order term cooccurrence and the values produced by the LSI process. We show a strong correlation between the number of cooccurrence paths and the value produced in the LSI term-term matrix.
متن کاملA Mathematical View of Latent Semantic Indexing: Tracing Term Co-occurrences
Current research in Latent Semantic Indexing (LSI) shows improvements in performance for a wide variety of information retrieval systems. We propose the development of a theoretical foundation for understanding the values produced in the reduced form of the term-term matrix. We assert that LSI’s use of higher orders of co-occurrence is a critical component of this study. In this work we present...
متن کاملA Latent Semantic Structure Model for Text Classification
Latent Semantic Indexing (LSI) has been successfully applied to information retrieval and classification. LSI can deal with the problems of polysemy and synonymy, and can reduce noise in the raw document-term matrix. However, LSI may ignore important features for some small categories because they are not the most important features for all the document collection. In this paper, we describe a ...
متن کاملA Similarity - based Probability Model for Latent Semantic IndexingChris
A dual probability model is constructed for the Latent Semantic Indexing (LSI) using the cosine similarity measure. Both the document-document similarity matrix and the term-term similarity matrix naturally arise from the maximum likelihood estimation of the model parameters, and the optimal solutions are the latent semantic vectors of of LSI. Dimensionality reduction is justiied by the statist...
متن کاملAssessing the Impact of Sparsification on LSI Performance
We describe an approach to information retrieval using Latent Semantic Indexing (LSI) that directly manipulates the values in the Singular Value Decomposition (SVD) matrices. We convert the dense term by dimension matrix into a sparse matrix by removing a fixed percentage of the values. We present retrieval and runtime performance results, using seven collections, which show that using this tec...
متن کامل